Machine Learning for Naive Bayesian Spam Filter Tokenization
نویسنده
چکیده
Background Traditional client level spam filters rely on rule based heuristics. While these filters can be effective they have several limitations. The rules must be created by hand. This requires the filter creator to examine a corpus of spam and cull out characteristics. This is a time consuming process and it is easy to miss rules which are quite effective at detecting spam. While the word ”Viagra” is obviously a very good indicator of spam other non-obvious words are just as good. In the spam corpus I used, which will be described below, the word ”webmaster” had a .98% chance of occurring in a spam as did ”ff6600”, the hex value for a light orange. Even assuming a good ruleset is created heuristic filters are still a static detection method. As spam changes the rules need to be tweaked and new rules need to be added. Worse yet the rules are the same for everybody that uses a given filter. A spammer can simply run his new spam through popular filters before he sends it out to make sure it will get through. Statistical filtering seeks to correct these problems. [3] A statistical filter computes the probability that a given e-mail is a spam instead of relying on somewhat arbitrary rules. To do so a corpus of spam and non-spam e-mails is built. Each corpus is then tokenized and the probability that any given token occurred in a spam is computed. Using this information it is possible to compute the overall probability that a new e-mail is a spam in a variety of ways. While probabilistic filters help to filter spam more efficiently they introduce problems of their own. Definitions for tokenization must still be worked out and this is no trivial task. This paper explores machine learning techniques for tokenization definitions. Goals I aimed design and implement two different types of machine learning methods for tokenization. The first was a hill climbing algorithm [4] and the second was a [2]genetic algorithm. As a baseline for comparison I also implemented a filter based on a very simple tokenization method. My goals were to achieve better spam classification, that is more spams detected with fewer false positives (non-spams classified as spams), than the simple tokenization method. Methods The filters relied on a probabilistic method known as Bayesian filtering. Bayesian filtering uses a naive version of Bayes rule to compute the overall propagability that an e-mail is a spam assuming that the individual probabilities of each token appearing in a spam are independent of one another.
منابع مشابه
Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach
We investigate the performance of two machine learning algorithms in the context of antispam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far been based mostly on keyword patterns that are constructed by hand and perform poorly. The Naive Bayesian classifier has recently been suggested as an ...
متن کاملA New Approach to Spam Mail Detection
The ever increasing menace of spam is bringing down productivity. More than 70% of the email messages are spam, and it has become a challenge to separate such messages from the legitimate ones. I have developed a spam identification engine which employs naive Bayesian classifier to identify spam. A new concept-based mining model that analyzes terms on the sentence, document is introduced. . The...
متن کاملMachine Learning Techniques in Spam Filtering
The article gives an overview of some of the most popular machine learning methods (Bayesian classification, k-NN, ANNs, SVMs) and of their applicability to the problem of spam-filtering. Brief descriptions of the algorithms are presented, which are meant to be understandable by a reader not familiar with them before. A most trivial sample implementation of the named techniques was made by the ...
متن کاملMachine Learning methods for E-mail Classification
The increasing volume of unsolicited bulk e-mail (also known as spam) has generated a need for reliable antispam filters. Using a classifier based on machine learning techniques to automatically filter out spam email has drawn many researchers attention. In this paper we review some of the most popular machine learning methods (Bayesian classification, k-NN, ANNs, SVMs, Artificial immune system...
متن کاملNot So Naive Online Bayesian Spam Filter
Spam filtering, as a key problem in electronic communication, has drawn significant attention due to increasingly huge amounts of junk email on the Internet. Content-based filtering is one reliable method in combating with spammers changing tactics. Naı̈ve Bayes (NB) is one of the earliest content-based machine learning methods both in theory and practice in combating with spammers, which is eas...
متن کامل